Improving Syntax-Augmented Machine Translation by Coarsening the Label Set
نویسندگان
چکیده
We present a new variant of the SyntaxAugmented Machine Translation (SAMT) formalism with a category-coarsening algorithm originally developed for tree-to-tree grammars. We induce bilingual labels into the SAMT grammar, use them for category coarsening, then project back to monolingual labeling as in standard SAMT. The result is a “collapsed” grammar with the same expressive power and format as the original, but many fewer nonterminal labels. We show that the smaller label set provides improved translation scores by 1.14 BLEU on two Chinese– English test sets while reducing the occurrence of sparsity and ambiguity problems common to large label sets.
منابع مشابه
Automatic Category Label Coarsening for Syntax-Based Machine Translation
We consider SCFG-basedMT systems that get syntactic category labels from parsing both the source and target sides of parallel training data. The resulting joint nonterminals often lead to needlessly large label sets that are not optimized for an MT scenario. This paper presents a method of iteratively coarsening a label set for a particular language pair and training corpus. We apply this label...
متن کاملمدل ترجمه عبارت-مرزی با استفاده از برچسبهای کمعمق نحوی
Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...
متن کاملSyntax-Augmented Machine Translation using Syntax-Label Clustering
Recently, syntactic information has helped significantly to improve statistical machine translation. However, the use of syntactic information may have a negative impact on the speed of translation because of the large number of rules, especially when syntax labels are projected from a parser in syntax-augmented machine translation. In this paper, we propose a syntax-label clustering method tha...
متن کاملA New Subtree-Transfer Approach to Syntax-Based Reordering for Statistical Machine Translation
In this paper we address the problem of translating between languages with word order disparity. The idea of augmenting statistical machine translation (SMT) by using a syntax-based reordering step prior to translation, proposed in recent years, has been quite successful in improving translation quality. We present a new technique for extracting syntax-based reordering rules, which are derived ...
متن کاملThe CMU syntax-augmented machine translation system: SAMT on Hadoop with n-best alignments
We present the CMU Syntax Augmented Machine Translation System that was used in the IWSLT-08 evaluation campaign. We participated in the Full-BTEC data track for Chinese-English translation, focusing on transcript translation. For this year’s evaluation, we ported the Syntax AugmentedMT toolkit [1] to the HadoopMapReduce [2] parallel processing architecture, allowing us to efficiently run exper...
متن کامل